Statistical Learning and Data Mining

Tutorial 2: Linear Regression, K-Nearest Neighbours and AutoML


In this tutorial we will discuss how you can apply the linear regression and k-Nearest Neighbours methods in Python. We will also do some exploratory data analysis and use TPOT, an useful automated machine learning tool.

Ames housing dataset
Some exploratory data analysis
Preparing the data for machine learning
Linear Regression
k-Nearest Neighbours
Model Evaluation

This notebook relies on the following imports and settings. We load other libraries in context to make it clear what we are using them for.

1. Ames Housing Data

We use the Ames Housing dataset from De Cock (2011), which contains data about residential property sales in a North American city.

The dataset has 81 predictor variables of all standard types (continuous, discrete, nominal, and ordinal), Check the documentation for a description of the dataset.

The objective is to predict the sale price (the last column in the dataset). Our metric for evaluating performance will be the root mean squared error on the log scale. That implies that we care about the percentage errors in the predictions (approximately), rather than the errors measured in dollars.

To keep things simpler this time, we only consider two predictors: ground living area and the number of garage spots. Next week, we'll study how to work with many predictors of all types.

2. Some exploratory data analysis

Exploratory data analysis (EDA) is about understanding our data. Some of the goals of EDA are to discover useful information for building better machine learning models, identify potential problems to be addressed, and uncover interesting insights.

We'll talk more about EDA next week, but let's start doing it now.

2.1 Response

Today we'll be using the dataprep package to assist us with EDA. Click on the ! symbol that appears on some of the plots and it tell you some insights about the data.

The distribution of sale prices is very right skewed, with some outliers.

We can apply a log transformation to make a right-skewed variable closer to symmetrically distributed. Compare the two KDE and normal Q-Q plots.

2.2 Predictors

The ground living area is also very right-skewed. We'll leave this variable as is today, but we'll see later that it can be useful to transform skewed input variables too depending on the model to be trained.

The GarageCars column has a missing value. Based on studying full dataset, this missing value occurs for a house that does not have a garage. Therefore, we replace it with a zero in the following cell.

2.3 Bivariate Relationships

We now examine the pairwise relationships between the response variable and the predictors.

Plotting the sale price against the ground living area reveals funnel-shaped plot of the type that we discussed in the lecture, indicating non-constant error variance. This tends to occur together with non-linearity and skewed errors.

We also observe pronounced outliers in the plot, which is a cause for concern because they can have a disproportionate effect on our trained models.

As discussed in the lecture, a log transformation of the response can mitigate issues of non-constant error variance, non-linearity and skewed errors, though we're still left with some outliers in this case.

The next plot shows the relationship between the log sale price and the number of garage spots. It suggests a moderately strong association. Perhaps surprisingly, the Pearson correlation coefficient (0.675) is almost as high as that for the ground living area (0.696).

Note that there a very few houses with more than three garage spots.

3. Preparing the data for machine learning

To be able to compare the predictive performance of different methods, we randomly split the data into training and validation sets. We'll use the training set to fit the models and predict the observations in the validation set.

We use the Scikit-Learn train_test_split function to split the data.

Below, we specify that the training set will contain 70% of the data. The random_state is an arbitrary number. If we run the analysis again with the same random_state we will get the same training and validation sets, even though the data split is random. This is important because we should always be able to reproduce our work.

Next, we build the design matrices and response vectors for the training and validation sets. Because the k-Nearest Neighbours method will require all predictors to be on the same scale, we use the StandardScaler from scikit-learn to standardise the predictors. Each resulting column has a sample mean of zero and a standard deviation of one on the training set.

4. Linear Regression

4.1 Fitting a linear regression

Scikit-learn allows us to train and use a wide range of machine learning algorithms using a simple API:

  1. Import the learning algorithm.
  2. Specify the model and configuration.
  3. Train the model.
  4. Use the trained model to make predictions.

We use the LinearRegression class from Scikit-Learn to fit a linear regression.

To compute prediction, we simply have to do as follows:

4.2 Residual diagnostics

Residual diagnostics helps us to check whether our model is appropriate for the data and find ways to improve it.

We compute the residuals as follows.

First, we investigate the distribution. As we might have expected based on the EDA, the model is subject to a few outliers. The outliers and perhaps other factors make the distribution of the residuals very left-skewed, which is not ideal. we'll discuss some tools to handle outliers next week.

Plotting the fitted values against the residuals suggest no major additional issues besides the outliers.

There may be some weak nonlinearity in the relationship between ground living area and the residuals. It's probably best to transform this predictor as we'll do next week.

5. k-Nearest Neighbours

We instantiate the kNN method based the Euclidean distance as follows. We try two different configurations for the number of neighbours.

Let's also try kNN method with the Mahalanobis distance. First, we compute the covariance matrix for the predictors.

We then instantiate the model as follows.

Later in the unit we'll discuss how you can systematically select the number of neighbours.

6. AutoML with TPOT

Automated Machine Learning (AutoML) is an area of machine learning that seeks to automate one or multiple steps in the development of machine learning pipelines, for example data preparation and model selection.

The promise of AutoML is to allow even non-experts to build useful machine learning models with minimal effort. AutoML is a very active area of research in machine learning, and there has been progress in recent years.

Because the demand for machine learning experts outstrips supply, there are strong incentives for the development of AutoML frameworks. Well-known commercial products DataRobot AutoML, Google Cloud AutoML and H2O Driverless AI.

Fortunately, there are also excellent open-source AutoML tools that can increase your productivity wherever you are in you machine learning.

Today we'll use TPOT, which uses an optimisation technique called genetic programming to search over scikit-learn pipelines.

The higher the number of generations and population size, the better TPOT will tend to work. However, it will take longer to run. The settings above are well-below the default of 100 so that the cell doesn't take too long to run.

7. Validation results

Finally, we compare all the models on the validation set. The model found by TPOT has the highest predictive accuracy. The variations of KNN with 15 neighbours are not far behind.